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The paper advocates the use of a statistical tool dedicated to the exploration of data samples populated by several 
sources of events. This new technique, called sVlot, is able to unfold the contributions of the different sources to the 
distribution of a data sample in a given variable. The sVlot tool applies in the context of a Likelihood fit which is 
performed on the data sample to determine the yields of the various sources. 



1 Introduction 



This paper describes a new technique to explore 
a data sample when the latter consists of several 
sources of events merged into a single sample of 
events. The events are assumed to be characterized 
by a set of variables which can be split into two com- 
ponents. The first component is a set of variables for 
which the distributions of all the sources of events are 
known: below, these variables are referred to as the 
discriminating variable. The second component is a 
set of variables for which the distributions of some 
sources of events are either truly unknown or consid- 
ered as such: below, these variables are referred to 
as the control variables. 

The new technique, termed sVlot °, allows one 
to reconstruct the distributions for the control vari- 
able, independently for each of the various sources of 
events, without making use of any a priori knowledge 
on this variable. The aim is thus to use the knowl- 
edge available for the discriminating variables to be 
able to infer the behavior of the individual sources 
of events with respect to the control variable. An es- 
sential assumption for the s'Plot technique to apply 
is that the control variable is uncorrelated with the 
discriminating variables. 

The sPlot technique is developed in the context 
of a maximum Likelihood method making use of the 
discriminating variables. Section 2 is dedicated to 
the definition of fundamental objects necessary for 
the following. Section 3 presents an intermediate 
technique, simpler but inadequate, which is a first 
step towards the gVlot technique. The gVlot formal- 
ism is then developed Section 4 and its properties 
explained in Section 5. An example of s'Plot at work 
is provided in Section 6 and some applications are 



described in Section 7. Finally, the case where the 
control variable is correlated with the discriminating 
ones is discussed in Section 8. 

2 Basics and definitions 

One considers an unbinned extended maximum Like- 
lihood analysis of a data sample in which are merged 
several species (signal and background) of events. 
The log-Likelihood is expressed as: 



(1) 



e=l 



4=1 



where 



• iV is the total number of events considered, 

• Ng is the number of species of events populating 
the data sample, 

• Ni is the (non-integral) number of events ex- 
pected on the average for the i^^ species, 

• y represents the set of discriminating variables, 
which can be correlated with each other, 

• fi(j/e) is the value of the Probability Density 
Function (pdf) of y for the j**^ species and for 
event e. 

The log-Likelihood £ is a function of the Ng yields Ni 
and, possibly, of implicit free parameters designed to 
tune the pdfs on the data sample. These parameters 
as well as the yields Ni are determined by maximiz- 
ing the above log-Likelihood. 

The crucial point for the reliability of such an 
analysis is to use an exhaustive list of sources of 
events combined with an accurate description of all 
the pdfs ii. If the distributions of the control vari- 
ables are known (resp. unknown) for a particular 



"The s'Plot technique is the subject of a publication ^ where details of the calculations and more examples can be found. 
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source of events, one would like to compare the ex- 
pected distribution for this source to the one ex- 
tracted from the data sample (resp. determine the 
distribution for this source) 

The control variable x which, by definition, does 
not explicitly appear in the expression of C, can be: 

1. totally correlated with the discriminating vari- 
ables y (x belongs to the set y for example). 
This is the case treated in Section 3. 

2. uncorrelated with y. This is the subject of Sec- 
tion 4. 

3. partly correlated with y. This case is discussed 
Section 8. 

In an attempt to have access to the distributions of 
control variable's, a common method consists of ap- 
plying cuts which are designed to enhance the con- 
tributions to the data sample of particular sources of 
events. However, the result is frequently unsatisfac- 
tory: firstly because it can be used only if the signal 
has prominent features to be distinguished from the 
background, and secondly because of the cuts ap- 
plied, a sizeable fraction of signal events can be lost, 
while a large fraction of background events may re- 
main. 

The aim of the sVlot formalism developed in this 

paper is to unfold the true distribution (denoted in 
boldface Mn(x)) of a control variable x for events 
of the n*^ species (any one of the Ng species), from 
the sole knowledge of the pdfs of the discriminat- 
ing variables f^, the first step being to proceed to 
the maximum Likelihood fit to extract the yields -/Vj. 
The statistical technique gPZot allows to build his- 
tograms in X keeping all signal events while getting 
rid of all background events, and keeping track of the 
statistical uncertainties per bin in x. 

3 First step towards s'Plot: i^Vlot 

In this Section, as a means of introduction, one con- 
siders a variable x assumed to be totally correlated 
with y: a; is a function of y. A fit having been per- 
formed to determine the yields Ni for all species, one 
can define naively, for all events, the weight 



(2) 



which can be used to build an estimate, denoted Mn, 
of the a;-distribution of the species labelled n (signal 
or background): 



eCSx 



(3) 



where the sum runs over the events for which the x 
value lies in the bin centered on x and of total 
width Sx. 

In other words, NnM.n{x)Sx is the a;-distribution 
obtained by histogramming events, using the weight 
of Eq. (2). To obtain the expectation value of Mn, 
one should replace the sum in Eq. (3) by the integral 



I dyJ2Njij{y)S{xiy)-x)6x . (4) 



Similarly, identifying the number of events iV, as de- 
termined by the fit to the expected number of events, 
one readily obtains: 




Ar„M„(S)) = Ar„M„(S) . (5) 

Therefore, the sum over events of the naive weight 
reproduces, on average, the true distribution Mn(a;). 
Plots obtained that way are referred to as {^Vlots: 
they provide a correct means to reconstruct Mn(a;) 
only insofar as the variable considered is in the set 
of discriminating variables y. These i^Vlots suffer 
from a major drawback: x being fully correlated to 
y, the pdfs of x enter implicitly in the definition of the 
naive weight, and as a result, the M„ distributions 
cannot be used easily to assess the quality of the fit, 
because these distributions are biased in a way diffi- 
cult to grasp, when the pdfs fj(y) are not accurate. 
For example, let us consider a situation where, in the 
data sample, some events from the n**^ species show 
up far in the tail of the Mn(a;) distribution which 
is implicitly used in the fit. The presence of such 
events implies that the true distribution Mii(.t) must 
exhibit a tail which is not accounted for by Mn(a;). 
These events would enter in the reconstructed inVlot 
Mn with a very small weight, and they would thus 
escape detection by the above procedure: Mn would 
be close to Mn, the distribution assumed for x. Only 
a mismatch in the core of the a;-distribution can be 
revealed with i^Vlots. Stated differently, the error 
bars which can be attached to each individual bin of 



''Removing one of the discriminating variables from the set y before performing again the maximum Likelihood fit, one can 
consider the removed variable as a control variable x, provided it is uncorrelated with the others. 
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Mn cannot account for the systematical bias inherent 
to the inVlots. 

4 The s'Plot formalism 

In this Section one considers the more interesting 

case where the two sets of variables x and y are un- 
correlated. Hence, the total pdfs ii{x,y) all factorize 
into products Mj(a;)fj(?/). While performing the fit, 
which relies only on y, no a priori knowledge of the 
^-distributions is used. 

One may still consider the above distribution M„ 
(Eq. (3)), using the naive weight of Eq. (2). However 
in that case, the expectation value of Mn is a biased 
estimator of Mn: 



iVnMn(x) 



dydx 2^ NjMj{x)ij{y)6{x - x)Vn 



^ iV„ M„(x) 



Here, the naive weight is no longer satisfactory 
because, when summing over the events, the x- 
pdfs Mj(a;) appear now on the right hand side of 
Eq. (4), while they are absent in the weight. How- 
ever, one observes that the correction term in the 
right hand side of Eq. (6) is related to the inverse of 
the covariance matrix, given by the second deriva- 
tives of —jC: 



N 



fn{ye)fj{ye) 



On average, one gets: 

(V-.n = / dy ^"(^)^^(^) 

\ * nj / / AT f f \ 

J Efe=i^feffe(y) 

Therefore, Eq. (6) can be rewritten: 



(7) 



(8) 



(Mn(x)) = Mi{x)Ni {Y-j) . (9) 



Inverting this matrix equation, one recovers the dis- 
tribution of interest: 



7VnMn(5)=5^(Vn,-)(M,-(S; 



(10) 



Hence, when x is uncorrelated with the set y, the 
appropriate weight is not given by Eq. (2), but is 



the covariance-weighted quantity (thereafter called 
s Weight) defined by: 



(11) 



With this s Weight, the distribution of the control 
variable x can be obtained from the gVlot histogram: 

iVn sMn{x)Sx = sPniVe) , (12) 

e(Z5x 



which reproduces, on average, the true binned dis- 
tribution: 



A^„Mn(x) 



(13) 



The fact that the covariance matrix V^j enters in the 
definition of the sWeights is enlightening: in particu- 
lar, the sWeight can be positive or negative, and the 
estimators of the true pdfs are not constrained to be 
strictly positive. 



(6) 5 s'Piot properties 



Beside satisfying the essential asymptotic property 
Eq. (13), s'Plots bear properties which hold for finite 
statistics. 

The distribution gMn defined by Eq. (12) is guar- 
anteed to be normalized to unity and the sum over 

the species of the gVlots reproduces the data sample 
distribution of the control variable. These properties 
rely on maximizing the Likelihood: 

• Each x-distribution is properly normalized. The 
sum over the x-bins of N-^ s^n^x is equal to A^n: 



N 
e=l 



:7'n(2/e) = iVn . 



(14) 



• In each bin, the sum over all species of the ex- 
pected numbers of events equals to the number 
of events actually observed. In effect, for any 
event: 



(15) 



1=1 



Therefore, an s'Plot provides a consistent represen- 
tation of how all events from the various species are 
distributed in the control variable x. Summing up 
the Ng sVlots, one recovers the data sample distri- 
bution in X, and summing up the number of events 
entering in a gVlot for a given species, one recovers 
the yield of the species, as it is provided by the fit. 
For instance, if one observes an excess of events for a 
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particular n"^ species, in a given x-bin, this excess is 
effectively accounted for in the number of events 
resulting from the fit. To remove these events implies 
a corresponding decrease in N^. It remains to gauge 
how significant is an anomaly in the ^-distribution 
of the n'^ species. 

The statistical uncertainty on s^n{x)5x can 
be defined in each bin by 



V eCSx 



(16) 



The above properties Eqs. (13)-(15) are completed 
by the fact that the sum in quadrature of the un- 
certainties Eq. (16) reproduces the statistical uncer- 
tainty on the yield iV„, as it is provided by the fit. 
In effect, the sum over the a;-bins reads: 

>2[iV„ Mn6x] = V„„ . (17) 



[Sx] 



Therefore, for the expected number of events per x- 
bin indicated by the sVlots, the statistical uncertain- 
ties are straightforward to compute using Eq. (16). 
The latter expression is asymptotically correct, and 
it provides a consistent representation of how the 
overall uncertainty on is distributed in x among 
the events of the n*'' species. Because of Eq. (17), 
and since the determination of the yields is optimal 
when obtained using a Likelihood fit, one can con- 
clude that the sVlot technique is itself an optimal 
method to reconstruct distributions of control vari- 
ables. 

6 Illustrations 

An example of sVlot at work is taken from the anal- 
ysis where the method was first used One deals 
with a data sample in which three species are present: 
B°^7r+7r~ and B^^K^tt^ are signals and the main 
background comes from e^e~ ^qq. The variable 
which is not incorporated in the fit is called Ai? and 
is used here as the control variable x. The detailed 
description of the variables can be found in Refs. . 

The left plot of Fig. 1 shows the distribution 
of Ai? after applying a cut on the Likelihood ratio. 
Therefore, the resulting data distribution concerns a 
reduced subsample for which statistical fiuctuations 
cannot be attributed unambiguously to signal or to 
background. For example, the excess of events ap- 
pearing on the left of the peak is likely to be at- 
tributed to a harmless background fluctuation. 
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Figure 1. Signal distribution of the AS variable. The left 
figure is obtained applying a cut on the Likelihood ratio to 
enrich the data sample in signal events (about 60% of signal 
is kept). The right figure shows the sVlot for signal (all events 
are kept). 



Looking at the right plot of Fig. 1, which is a 
signal sVlot, one can see that these events are sig- 
nal events, not background events. The pdf of AE 
which is used in the conventional fit for the whole 
analysis is superimposed on the gVlot. When this 
pdf is used, the events in excess are interpreted as 
background events while performing the fit. Further 
studies have shown that these events are in fact ra- 
diative events, i.e. B'^^tt^tt^j. When ignored in the 
analysis they lead to underestimates of the branching 
ratios by about 10%. The updated results for the 
B'^^tt+tt^, K^tt^ analysis, now taking into account 
the contribution of radiative events, show agreement 
with the estimate made in Ref. ^. 

7 Applications 

Beside providing a convenient and optimal tool to 
cross-check the analysis by allowing distributions of 
control variables to be reconstructed and then com- 
pared with expectations, the sVlot formalism can be 
applied also to extract physics results, which would 
otherwise be difficult to obtain. For example, one 
may be willing to explore some unknown physics 
involved in the distribution of a variable x. Or, 
one may be interested to correct a particular yield 
provided by the Likelihood fit from a selection effi- 
ciency which is known to depend on a variable x, for 
which the pdf is unknown. Provided one can demon- 
strate {e.g. through Monte-Carlo simulations) that 
the variable x exhibits weak correlation with the dis- 
criminating variables y. 

To be specific, one can take the example of a 
three body decay analysis of a species, the signal, 
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polluted by background. The signal pdf inside the 
two-dimensional Dalitz plot is assumed to be not 
known, because of unknown contributions of reso- 
nances, continuum and of interference pattern. Since 
the x-dependence of the selection efficiency e{x) can 
be computed without a priori knowledge of the x- 
distributions, one can build the efficiency corrected 
two-dimensional /Plots (cf. Eq. (12)): 

-^iV, ,M„(x)fe = ^ -}—^VM , (18) 
and compute the efficiency corrected yields: 

Analyses can then use the gVlot formalism for valida- 
tion purposes, but also, using Eq. (18) and Eq. (19), 
to probe for resonance structures and to measure 
branching ratios ^. 

8 Correlation between variables 

Correlations between variables, if not trivial, arc usu- 
ally assessed by Monte-Carlo simulations. In case 
significant correlations are observed, one may still 
use the s'Plot weight of Eq. (11), but then there is 
a caveat. The distribution obtained with ^Vlot can- 
not be compared directly with the marginal distri- 
bution of X. In that case, one must rely on Monte- 
Carlo simulation, and apply the gVlot technique to 
the simulated events, in order to obtain Monte-Carlo 
sVlots. It is these Monte-Carlo s'Plots which are 
to be compared to the sVlot obtained with the real 
data. Stated differently, the gPlot can still be ap- 
plied to compare the behaviour of the data with the 
Monte-Carlo expected behavior, but it loses its sim- 
plicity. 

9 Conclusion 

The technique presented in this paper applies when 



• one examines a data sample originating from dif- 
ferent sources of events, 

• a Likelihood fit is performed on the data sample 
to determine the yields of the sources, 

• this Likelihood uses a set y of discriminating 
variables, 

• keeping aside a control variable x which is sta- 
tistically uncorrelated to the set y. 

By building gVlots, one can reconstruct the distri- 
butions of the control variable x, separately for each 
source present in the data sample. Although no cut 
is applied (hence, the gVlot of a given species repre- 
sents the whole statistics of this species) the distri- 
butions obtained are pure in a statistical sense: they 
are free from the potential background arising from 
the other species. The more discriminating the vari- 
ables y, the clearer the s'Plot is. The technique is 
straightforward to implement; it is available in the 
ROOT firamework under the class TSPlot'^. It fea- 
tures several nice properties; both the normaliza- 
tions and the statistical uncertainties of the gVlots 
reflect the flt ouputs. 
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